Abstract
Background: Intraoperative bleeding is a critical event that impacts surgical safety and patient outcomes. Machine learning (ML) has demonstrated potential in prediction tasks, yet its methodological rigor and clinical translation face challenges.
Objective: This scoping review aims to systematically synthesize the current state of development, performance, and validation of ML models for predicting intraoperative bleeding, and to identify key barriers to their clinical implementation.
Methods: Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, we systematically searched 7 databases (PubMed, Web of Science, Embase, CINAHL, CNKI [China National Knowledge Infrastructure], Wanfang, and VIP [China Science and Technology Journal Database]) from their inception to April 2025. Moreover, 2 reviewers (SY and PZ) independently screened studies, extracted data using the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS), and assessed the risk of bias using the Prediction Model Risk Of Bias Assessment Tool (PROBAST). A narrative synthesis was used for data analysis.
Results: Out of 2651 screened records, 23 studies were included (sample sizes ranging from 48 to 48,543). Tree-based ensemble models (eg, random forests and extreme gradient boosting) were the most frequently used (16/23, 70%), followed by logistic regression (13/23, 57%), and deep learning (11/23, 48%). Model discrimination varied widely (mean area under the curve [AUC] 0.82, SD 0.08, range 0.63‐0.93). Integration of multimodal data (electronic health records+imaging) was associated with higher performance. However, model validation was often inadequate; only 6 studies (6/23, 26%) performed external validation, and performance often declined (eg, AUC decreased from 0.85 to 0.63 in 1 study). Reporting exhibited selective bias; AUC was commonly reported (19/23, 83%), whereas key classification metrics, such as calibration (10/23, 43%) and precision (4/23, 17%), were often omitted. PROBAST assessment indicated a high risk of bias in all included studies (23/23, 100%).
Conclusions: While ML models demonstrate technical promise for predicting intraoperative bleeding, our PROBAST assessment revealed a universally high risk of bias across all included studies. This fundamental methodological limitation, coupled with a severe lack of external validation and poor transparency in reporting, severely constrains the current clinical reliability of these models. Future research must prioritize prospective multicenter validation, adherence to Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines, and enhanced model interpretability to bridge the gap toward clinical utility.
doi:10.2196/80930
Keywords
Introduction
Background
Perioperative bleeding is a significant risk factor for surgical procedures and is strongly linked to increased patient mortality, higher rates of postoperative complications, and excessive use of health care resources []. Intraoperative bleeding control effectiveness directly impacts both surgical safety and patient outcomes []. Excessive blood loss compromises the surgical field. It prolongs the duration of surgery [] while also increasing the risk of severe adverse events, such as myocardial infarction and acute kidney injury []. While patient blood management strategies focus on optimizing preoperative risk assessment, facilitating real-time intraoperative interventions, and guiding postoperative transfusion decisions through accurate predictions of blood loss [], current clinical practice still struggles with the reliability of predictive tools.
Intraoperative blood loss, quantified as estimated blood loss, is a fundamental quantitative metric in perioperative management, providing critical evidence to guide fluid resuscitation strategies, transfusion decisions, and the prevention and control of postoperative complications. Consequently, monitoring accuracy is regarded as a quality standard for perioperative care []. However, current clinical assessment methods exhibit dual limitations—subjective assessment techniques (eg, visual estimation of soaked gauze or suction canister volume) are susceptible to operator experience, resulting in high error rates, and calculation-based methods (relying on material weight differences) struggle to capture the dynamic blood loss process in real time []. Such inaccuracies can lead to erroneous transfusion decisions. Research has confirmed that inappropriate transfusion is an independent risk factor for postoperative infection and organ dysfunction []. Although existing risk assessment tools (the surgical blood loss score) are widely used [,,], their inherent weaknesses, namely heterogeneous scoring criteria and a lag in advances in surgical techniques, are becoming increasingly apparent.
Although traditional prediction models (such as logistic regression) are widely used, they are constrained by linear assumptions and fail to effectively capture complex nonlinear interactions and multicollinearity among variables. Evidence suggests that predictive accuracy based on clinical experience is significantly lower than that achieved by machine learning (ML) methods []. With the advancement of hospital information platforms, vast amounts of high-dimensional, heterogeneous clinical data have been accumulated. Due to its unique advantages in processing such data and identifying nonlinear patterns [], ML has rapidly emerged as a research hotspot in the field of intraoperative bleeding prediction. However, the existing body of research evidence exhibits significant fragmentation. Studies predominantly concentrate on single surgical procedures (eg, cesarean section [] and spinal surgery []), resulting in a scarcity of cross-scenario algorithm comparisons; equally important, the methodological quality and validation rigor of these models are highly variable and often inadequate. Methodological limitations (such as inconsistent data preprocessing and the absence of standardized validation frameworks) have yet to be systematically evaluated and standardized. More critically, the clinical translation pathway is severely hindered by inadequate model generalizability, largely due to a pervasive lack of robust external validation. This fragmented landscape and lack of comprehensive evaluation, coupled with unaddressed methodological concerns, critically impede the understanding of ML’s actual value and the identification of optimal implementation pathways for intraoperative bleeding prediction, necessitating the urgent integration and assessment of these methodologies through systematic approaches.
Research Objective
Therefore, based on the PRISMA-Scr (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) framework [], this study establishes the following objectives:
- To show how ML algorithms are used to predict bleeding during surgery in different settings;
- To look at how ways of building and testing models (like picking features or choosing algorithms) affect their results (such as sensitivity and specificity);
- To find the best-performing algorithms and define the criteria to judge them in specific fields; and
- To highlight key problems that slow real-world use and suggest practical steps for future research and practice.
Methods
Overview
This scoping review was conducted following the methodological framework proposed by Arksey and O’Malley [] and reported in accordance with the PRISMA-ScR guidelines [] to ensure transparency and consistency. Given the focus on prediction models, the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies (CHARMS) [] was used to guide data extraction.
Search Strategy
A systematic literature search was conducted on April 10, 2025. The search followed the Population, Intervention, Comparator, Outcome, and Study design (PICOS) framework. Both controlled vocabularies (eg, Medical Subject Headings for PubMed and Emtree for Embase) and free-text terms were used. Searches focused on three concepts—population (patients undergoing surgery), predictive tool (ML models), and outcome (risk of bleeding during surgery). In total, seven databases were searched—PubMed, Web of Science, Embase, CINAHL Complete, CNKI (China National Knowledge Infrastructure), Wanfang Data, and VIP (China Science and Technology Journal Database). details search strategies for each database. Reference lists of included studies and leading journals were also manually screened.
| Database | Hits, n | Search strategy |
| PubMed | 86 | (“Machine Learning”[Mesh] OR “Artificial Intelligence”[Mesh] OR “machine learning”[tiab] OR “deep learning”[tiab]) AND (“Surgery”[Mesh] OR “Surgical Procedures, Operative”[Mesh] OR surg[tiab] OR intraoperative[tiab]) AND (“Intraoperative Complications”[Mesh] OR “Hemorrhage”[Mesh] OR “Blood Loss, Surgical”[Mesh] OR bleed[tiab] OR “blood loss”[tiab]) |
| Web of Science | 79 | TS=((“machine learning” OR “artificial intelligence”) AND (surg* OR intraoperative) AND (bleed* OR “blood loss” OR hemorrhag*)) |
| Embase | 220 | (‘machine learning’/exp OR ‘artificial intelligence’/exp OR ‘machine learning’:ab,ti) AND (‘surgery’/exp OR ‘intraoperative period’/exp OR surg:ab,ti) AND (‘intraoperative bleeding’/exp OR ‘surgical blood loss’/exp OR bleed:ab,ti) |
| CINAHL Complete | 1709 | (MH “Machine Learning+” OR TI “machine learning” OR AB “artificial intelligence”) AND (MH “Surgery, Operative+” OR TI surg* OR AB intraoperative) AND (MH “Intraoperative Complications+” OR MH “Blood Loss, Surgical+” OR TI bleed* OR AB “blood loss”) |
| CNKI | 212 | (SU=(‘machine learning’ OR ‘deep learning’ OR ‘artificial intelligence’)) AND (SU=(‘surgery’ OR ‘intraoperative’ OR ‘surgical procedure’)) AND (SU=(‘intraoperative bleeding’ OR ‘surgical bleeding’ OR ‘blood loss’)) |
| Wanfang Data | 331 | (Subject:(“machine learning” OR “artificial intelligence”)) AND (Subject:(“surgery” OR “surgical”)) AND (Subject:(“intraoperative bleeding” OR “surgical bleeding”)) |
| VIP | 12 | (U=(‘machine learning’ OR ‘artificial intelligence’)) AND (U=(‘intraoperative bleeding’ OR ‘surgical blood loss’)) AND (M=(‘surgery’) OR T=(‘surgical patients’)) |
aCNKI: China National Knowledge Infrastructure.
bVIP: China Science and Technology Journal Database.
Study Selection
Initial search records were imported into EndNote X9 (Clarivate). Duplicates were removed using automated and manual deduplication. Moreover, 2 reviewers (SY and PZ) independently screened titles and abstracts for relevance, recording decisions separately. For records retained after screening, both assessed full-text articles for eligibility and recorded decisions independently. Assessment was blind to ensure objectivity. Disagreements were resolved through discussion or, if needed, a third senior researcher (HH). A systematic review decision matrix () guided the application of eligibility criteria.
| Category | Inclusion criteria | Exclusion criteria |
| Population | Adult patients (≥18 y) undergoing surgery | ― |
| Predictive tool | ML-based models explicitly developed to predict intraoperative bleeding risk | Models predicting only postoperative bleeding or failing to distinguish intraoperative or postoperative outcomes |
| Outcome reporting | Reported at least one performance metric: area under the curve (AUC), sensitivity, or specificity | ― |
| Study design | Primary research: retrospective or prospective cohort studies, case-control studies | Conference abstracts, reviews, case reports, editorials, letters |
| Publication status | Full text in Chinese or English (including peer-reviewed preprints) | Non–peer-reviewed manuscripts, publications not in Chinese or English |
| Data source | ― | Nonclinical or invalid sources: animal experiments, simulated datasets, nonhospital data |
aNot applicable.
Eligibility Criteria
Studies that did not meet the inclusion criteria were excluded during screening. Eligibility was determined using predefined criteria outlined in . The review decision matrix applied these criteria to the full texts to determine whether studies reported outcomes of intraoperative bleeding prediction.
Data Extraction and Synthesis
Data extraction was performed independently by 2 reviewers (SY and PZ) using a standardized electronic form based on the aforementioned CHARMS checklist. The reviewers extracted data on the following: (1) study characteristics (author, year, country, design, sample size, surgical type, and data source), (2) model development (candidate and final predictors, data preprocessing, and ML algorithms), and (3) model performance and validation (validation method, performance metrics such as area under the curve [AUC], sensitivity, specificity, precision, and calibration). Any discrepancies were resolved through discussion or by consultation with a third reviewer (HH). Given methodological heterogeneity across studies, including differences in algorithms, validation strategies, and outcome reporting, a narrative synthesis was used for data analysis. The primary studies in this review reported model performance metrics (eg, AUC and sensitivity) and their CIs, not traditional hypothesis-testing P values for intergroup comparisons. Therefore, P values were neither extracted nor assessed. This approach aligns with the methodological focus of prediction model research.
Risk of Bias and Quality Assessment
The risk of bias and applicability of the included studies were rigorously assessed using the Prediction model Risk of Bias Assessment Tool (PROBAST) []. PROBAST tool covers 4 domains—participants, predictors, outcome, and analysis. Furthermore, 2 reviewers (SY and PZ) independently assessed each study, with disagreements resolved by consensus or consultation with a third researcher (HH). The results of this assessment are summarized descriptively in the Results section.
Ethical Considerations
This study did not require ethical approval. We did not study any human or animal subjects, and we did not collect personal information or sensitive data.
Results
Search Results
The systematic search initially identified 2651 records. After removing 143 duplicates, 2508 records were screened based on titles and abstracts. Of these, 2429 records were excluded. The full texts of the remaining 79 articles were assessed for eligibility, of which 56 were excluded for reasons detailed in . Consequently, 23 studies [,,,-] met the inclusion criteria and were included in this scoping review ().

Characteristics of Included Studies
The detailed characteristics of the 23 included studies [,,,-] are presented in . The sample sizes varied widely, ranging from 48 to 48,543 cases. All studies were retrospective in design, with 17 (74%) [,,,-,-] being single-center investigations. The publication years were concentrated between 2019 and 2025, and the geographical distribution was highly skewed, with studies from China dominating (17/23, 74% [,,-,,,,-]). The main surgical contexts were obstetric procedures (10/23, 43% [,,,,,,,,,]), orthopedic surgery (4/23, 17% [,,,]), and hepato-biliary surgery (4/23, 17% [,,,]). Considerable heterogeneity was observed in the definitions of intraoperative major bleeding across studies, ranging from ≥200 mL to >5000 mL.
| Author, year | Country | Study design | Surgical type (Specific procedure) | Sample size (Development/Validation) | Data source | EBL definition |
| Akazawa and Hashimoto [], 2023 | Japan | Single-Center Retrospective Cohort Study | Obstetric (Cesarean section) | 48 | MRI + EMR | ≥2000 mL |
| Akazawa and Hashimoto [], 2024 | Japan | Multi-Center Retrospective Cohort Study | Obstetric (Cesarean section) | 63 (50/13) | MRI + EMR | >2000 mL |
| Chen et al [], 2024 | China | Multi-Center Retrospective Cohort Study | Obstetric (Cesarean section) | 1975 (1680/295) | EMR | ≥300 mL |
| de Reus DC et al [], 2025 | United States, Netherlands, and United Kingdom | Multi-center Retrospective Cohort Study | Orthopedic (Spinal decompression) | 880 | EMR | >2500 mL |
| Li et al [], 2024 | China | Single-Center Retrospective Study | Hepatic (Tumor resection) | 406 (284/122) | EMR | ≥1000 mL |
| Liu et al [], 2020 | China | Single-Center Retrospective Study | Obstetric (Cesarean section) | 210 | MRI | ≥500 mL |
| Mo et al [], 2023 | China | Multi-center Retrospective Study | Gynecological (Hysteroscopic surgery) | 200 (120/80) | EMR | ≥200 mL |
| Park et al [], 2022 | South Korea | Single-Center Retrospective Study | Hepatic (Transplantation) | 414 | EMR | ≥5000 mL |
| Shi et al [], 2024 | China | Multi-center Observational Cohort Study | Orthopedic (Spinal decompression) | 276 (200/76) | EMR | ≥2500 mL |
| Shi et al [], 2023 | China | Single-Center Retrospective Cohort Study | Multi-departmental surgeries | 48,543 | EMR | >200 mL |
| Stehrer et al [], 2019 | Austria | Single-Center Retrospective Study | Craniofacial (Orthognathic surgery) | 950 (760/190) | EMR | Calculated using hemoglobin balance method |
| Sun et al [], 2025 | China | Single-Center Retrospective Study | Orthopedic (Lumbar fusion) | 2054 (1437/617) | EMR | ≥500 mL |
| Wang [], 2023 | China | Single-Center Retrospective Study | Obstetric (Cesarean section) | 168 (117/51) | EMR | >1000 mL |
| Wakiya et al [], 2021 | Japan | Single-Center Retrospective Cohort Study | General (Pancreatic cancer resection) | 175 (128/47) | EMR | >20% of circulating blood volume |
| Xu [], 2024 | China | Single-Center Retrospective Study | Obstetric (Cesarean section) | 249 (149/50/50) | MRI + EMR | ≥1000 mL |
| Xue et al [], 2021 | China | Single-Center Retrospective Study | Hepatic (Tumor resection) | 665 (466/199) | EMR | ≥800 mL |
| Yang et al [], 2022 | China | Single-Center Retrospective Study | Orthopedic (Spinal fracture) | 161 | EMR | Hidden blood loss (no explicit quantitative threshold) |
| Yang et al [], 2023 | China | Multi-center Retrospective Cohort Study | Obstetric (Cesarean section) | 125 (85/40) | MRI + EMR | ≥1500 mL |
| Yin et al [], 2021 | China | Single-Center Retrospective Study | Oncological (Pelvic/sacral tumors) | 810 | CT + EMR | >3000 mL |
| Zheng et al [], 2024 | China | Single-Center Retrospective Study | Obstetric (Cesarean section) | 346 (156/68/122) | MRI + Coagulation tests + EMR | >1000 mL |
| Zheng et al [], 2022 | China | Single-Center Retrospective Study | Hepatic (Tumor resection) | 336 (268/68) | EMR | ≥300 mL |
| Zong et al [], 2024 | China | Single-Center Retrospective Cohort Study | Obstetric (Cesarean section) | 323 (227/96) | MRI + EMR | ≥1500 mL |
| Li [], 2024 | China | Single-Center Retrospective Case-Control Study | Obstetric (Cesarean section) | 231 | EMR | ≥1500 mL |
aEBL: estimated blood loss.
bMRI: magnetic resonance imaging.
cEMR: electronic medical record.
dCT: computed tomography.
Technical Features and Performance of Prediction Models
All models were based on electronic health records (EHRs). A total of 8 studies (35%) [,,,,-,] further integrated medical imaging data, including magnetic resonance imaging (MRI) or computed tomography, of which 7 (30%) [,,,,,,] focused on predicting obstetric bleeding. In terms of algorithms, tree-based ensemble models were most frequently applied (12/23, 52% [,,,-,,-,,]), especially random forests (8/23, 34% [,,,,,,,]) and extreme gradient boosting (9/23, 39% [,,,,,,,]); logistic regression (13/23, 57% [,,,,,,,,-,]) and deep learning (6/23, 26% [,,,,,]) models were also commonly used. Model discrimination performance is illustrated in . The AUC ranged from 0.63 to 0.93, with a mean of 0.82 (SD 0.08). Models incorporating multimodal data (eg, EHR+imaging) showed a performance advantage (mean AUC≈0.84, SD 0.075) over unimodal models relying solely on EHR (mean AUC≈0.82, SD 0.069). For instance, the support vector machine model by Xu [], which fused MRI radiomic features with clinical data, achieved an AUC of 0.87.
| Author | Predictors categories | ML algorithms | Best model | Internal validation (test set performance) | External validation performance | Validation methods |
| Akazawa and Hashimoto [] | MRI, laboratory parameters, demographic characteristics | Multimodal DL, XGBoost, VGG16 | Multimodal DL | AUC=0.73 (95% CI 0.66‐0.80), Accuracy=0.68 | Not reported | Random split (8:2), cross-validation |
| Akazawa and Hashimoto [] | Radiomics features, clinical variables | LR | LR | AUC=0.69 (95% CI 0.62‐0.75) | AUC=0.70 (95% CI 0.66‐0.73) | Internal: random split (7:3), external: another institution |
| Chen et al [] | Clinical variables | Bayes, MLP, DT, KNN, LR, RF, SVM, XGBoost | Bayes | AUC=0.82 (95% CI 0.80‐0.84), Sensitivity=0.93, Specificity=0.42, F score=0.92 | AUC=0.85 (95% CI 0.83‐0.87), Sensitivity=0.95, Specificity=0.50, F score=0.96 | Internal validation: 10-fold cross-validation, (8:2 split), multicenter external validation |
| de Reus DC et al [] | Tumor type, ECOG score, surgical procedure, preoperative platelet count | Not reported | Not reported | Not reported | AUC=0.63 (95% CI 0.58‐0.68), Sensitivity=0.74, Specificity=0.41, F score=0.33 | Multicenter external validation |
| Li et al [] | Demographic characteristics, laboratory parameters, imaging characteristics, pathological characteristics | LR | LR | AUC=0.80 | Not reported | Random split (training set:test set=7:3) |
| Liu et al [] | MRI | DL | VGG16 | Accuracy=0.75, Sensitivity=0.73, Specificity=0.77 | Not reported | 5-fold cross-validation |
| Mo et al [] | Clinical variables | DNN | DNN | Accuracy=0.91, Sensitivity=0.89, Specificity=0.92, Precision=0.92 | Not reported | Training:test=6:4 |
| Park et al [] | Laboratory parameters, surgical parameters, MELD score, demographic characteristics | LR, Elastic Net, SVM, RF, XGBoost, NN | LR | AUROC=0.84, AUPR=0.82 | Not reported | Training:test=7:3, feature selection via nested cross-validation |
| Shi et al [] | Tumor type, ECOG score, surgical procedure, preoperative platelet count | LR, KNN, DT, XGBoost, RF, SVM | XGBoost | AUC=0.85 (95% CI 0.82‐0.87), Accuracy=0.77, Recall=0.85, F score=0.78, Precision=0.72 | AUC=0.80(95% CI 0.77‐0.86), Accuracy=0.73, Recall=0.73, F score=0.73, Precision=0.73 | Internal validation: random split (7:3 ratio), external validation: independent cohort |
| Shi et al [] | Surgical parameters, laboratory parameters, demographic characteristics | LGB, XGBoost, CatB, AdaB, LR, LSTM, MLP | LGB | AUC=0.93, Accuracy=0.87, Sensitivity=0.8, Specificity=0.85 | Not reported | Training:test =2:1, ADASYN was used to address data imbalance |
| Stehrer et al [] | Surgical parameters, laboratory parameters, demographic characteristics | RF | RF | Regression performance: significant correlation between predicted and actual values; mean error 7.4 (SD 172.3) mL | Not reported | Random split (training:test=8:2), performance evaluation: correlation and mean error between predicted and actual values |
| Sun et al [] | Surgical parameters, laboratory parameters, demographic characteristics | LR | LR | AUC=0.73 (95% CI 0.67‐0.79), Accuracy=0.88 | Not reported | Random split (training set:test set=7:3), 5-fold cross-validation |
| Wang [] | Radiomics features, clinical variables | LR, SVM, RF, SGD, KNN | LR | AUC=0.83, Accuracy=0.80, Sensitivity=0.75, Specificity=0.83 | Not reported | Random split (training set:test set=7:3), 5-fold cross-validation |
| Wakiya et al [] | Surgical parameters, laboratory parameters, tumor markers | DT | DT | Accuracy=0.80, Sensitivity=1, Specificity=0.66 | Not reported | Random split (training set:test set=3:1) |
| Xu [] | Radiomics features, clinical features | SVM | SVM | AUC=0.87, Accuracy=0.85, Sensitivity=0.72, Specificity=0.89 | Not reported | Random split: training:validation:test=6:2:2 |
| Xue et al [] | Laboratory parameters | LR, DT, XGBoost, CNN, LSTM | XGBoost | AUC=0.72, Accuracy=0.87, Precision=1, Recall=0.18, F score=0.31 | Not reported | Random split (training set:test set=7:3), 5-fold cross-validation |
| Yang et al [] | Demographic characteristics, surgical parameters, laboratory parameters | XGBoost, LR, LGBM, RF, SVM | RF | AUC=0.86, Accuracy=0.78, Sensitivity=0.86, Specificity=0.81 | Not reported | Random split into training and internal validation sets; 15-fold cross-validation conducted on the training set |
| Yang et al [] | MRI-anatomical-clinical features, morphological features | LR, SVM, RF, XGBoost | XGBoost | AUROC=0.88 (95% CI 0.74‐1.00), Accuracy=0.85, Sensitivity=0.90, Specificity=0.81 | AUROC=0.82 (95% CI 0.68‐0.96), Accuracy=0.78, Sensitivity=0.81, Specificity=0.75 | Data from 2 medical centers |
| Yin et al [] | CT-based radiomics features, clinical factors | DNN, LR | DNN | AUC=0.92, Accuracy=0.75, Sensitivity=0.30, Specificity=0.83 | Not reported | Random split (training set:test set=7:3), temporal split, class imbalance handling: SMOTE |
| Zheng et al [] | Radiomics features, clinical factors, laboratory parameters | SVM | SVM | AUC=0.87 (95% CI 0.76‐0.94), Accuracy=0.76, Sensitivity=1, Specificity=0.65 | AUC=0.81 (95% CI 0.72‐0.87), Accuracy=0.79, Sensitivity=0.87, Specificity=0.65 | Center 1: partitioned into training and internal test sets. Center 2: designated as the external test set. |
| Zheng et al [] | Tumor characteristics, surgical parameters, laboratory parameters | RF, MDN | RF | AUC=0.79 (95% CI 0.65‐0.93), Accuracy=0.82 | Not reported | Random split (training set:test set=8:2), bootstrap |
| Zong et al [] | Multiparametric MRI | DL | MS-3D-ResNet | AUC=0.87 (95% CI 0.86‐0.89), Accuracy=0.85, Sensitivity=0.86, Specificity=0.85 | Not reported | Random split (training set:test set=7:3) |
| Li [] | Clinical risk factors in obstetrics | LR, DT, KNN, BPNN, XGBoost, LGBM | LR | AUC=0.88 (95% CI 0.83‐0.92), Accuracy=0.77, Sensitivity=0.84, Specificity=0.67, PPV=0.78, NPV=0.75 | Not reported | 5-fold cross-validation |
aML: machine learning.
bMRI: magnetic resonance imaging.
cDL: deep learning.
dXGBoost: extreme gradient boosting.
eVGG-16: visual geometry group - 16 layers.
fAUC: area under the curve.
gLR: logistic regression.
hBayes: naïve Bayes.
iMLP: multilayer perceptron.
jDT: decision tree.
kKNN: k-nearest neighbors.
lRF: random forest.
mSVM: support vector machine.
nECOG: eastern cooperative oncology group.
oDNN: deep neural network.
pMELD: model for end-stage liver disease.
qNN: neural network.
rAUROC: area under receiver operating characteristic curve.
sAUPR: area under the precision versus recall curve.
tLGB: light gradient boosting machine (LightGBM).
uCatB: categorical boosting (CatBoost).
vAdaB: adaptive boosting (AdaBoost).
wLSTM: long short-term memory.
xADASYN: adaptive synthetic sampling.
ySGD: stochastic gradient descent.
zCNN: convolutional neural networks.
aaSMOTE: synthetic minority over-sampling technique.
abCT: computed tomography.
acMDN: mixture density network.
adMS-3D-ResNet: multi-stream 3D residual network.
aeBPNN: back propagation neural network.
afPPV: positive predictive value.
agNPV: negative predictive value.
Model Validation Strategies
Although internal validation was widely applied (22/23, 96% [,,,,-]), its methodological rigor was insufficient (). Only half of the studies (12/23, 52% [,,,,,-,]) established an independent test set to evaluate final performance; even fewer used cross-validation (9/23, 39% [,,,,,,,,]). External validation was notably lacking, implemented in only 6 studies (26%) [,,,,,]. Critically, among the limited external validations, model performance generally declined. For example, the model by Shi et al [] dropped from an internal AUC of 0.85 to an external AUC of 0.80; when de Reus et al [] independently validated the same model in a multinational, multicenter setting, the AUC further decreased to 0.63.
Completeness of Performance Metric Reporting
There was substantial selective bias in the reporting of key performance metrics (). Discrimination metrics AUC were reported most frequently (19/23, 83% [,,,-,,,,,-]), whereas reporting of essential classification metrics was incomplete: sensitivity (16/23, 70% [,,,,,,-,,]), specificity (14/23, 61% [,,,,,-,-,,]). Reporting rates for precision (4/23, 17% [,,,]) and F1-score (4/23, 17% [,,,]) were very low. Furthermore, only 10/23 (43%) [,,,,,,-,] of the studies reported model calibration (eg, calibration curves).
Data Preprocessing and Interpretability
Reporting of data-preprocessing pipelines was seriously deficient (). In total, 11 studies (47%) [,,,,-,,] did not describe any method for handling missing data. Only 3 studies (13%) [,,] reported strategies to address class imbalance (eg, using the synthetic minority oversampling technique [SMOTE]). The vast majority of studies neither applied nor reported any model interpretability analyses (eg, Shapley Additive Explanations [SHAP] and local interpretable model-agnostic explanations), rendering the models essentially “black-box.”
| Author | Missing data handling | Class imbalance handling | Data normalization or standardization |
| Akazawa and Hashimoto [] | Not reported | Not reported | Not reported |
| Akazawa and Hashimoto [] | Exclusion of cases with missing data | Not reported | Standardization of all radiomic features |
| Chen et al [] | Multiple imputation using MICE package | Not reported | Standardization: numerical variables were standardized |
| de Reus DC et al [] | Multiple imputation combined with exclusion | Not reported | Not reported |
| Li et al [] | Not reported | Not reported | Not reported |
| Liu et al [] | Not reported | Not reported | Not reported |
| Mo et al [] | Missing values were filled with 0 | Not reported | Not reported |
| Park et al [] | Not reported | Not reported | Not reported |
| Shi et al [] | Median imputation | SMOTE Tomek | Not reported |
| Shi et al [] | KNN imputation | ADASYN | Not reported |
| Stehrer et al [] | Exclusion if >25% missing; mean or mode imputation if <25% | Not reported | Not reported |
| Sun et al [] | Exclusion of patients with missing key indicators | Not reported | Not reported |
| Wang [] | Not reported | Not reported | Z-score normalization |
| Wakiya et al [] | Not reported | Not reported | Not reported |
| Xu [] | Exclusion of patients with missing key indicators | Not reported | MRI pixel values scaled to [0,1] |
| Xue et al [] | Not reported | Not reported | Not reported |
| Yang et al [] | Not reported | Not reported | Not reported |
| Yang et al [] | Not reported | Not reported | Not reported |
| Yin et al [] | Not reported | SMOTE | Not reported |
| Zheng et al [] | Not reported | Not reported | Not reported |
| Zheng et al [] | Not reported | Not reported | Standardization and normalization applied |
| Zong et al [] | Not reported | Not reported | Not reported |
| Li [] | Not reported | Not reported | Not reported |
aMICE: multivariate imputation by chained equations.
bSMOTE: synthetic minority oversampling technique.
cKNN: k-nearest neighbors.
dADASYN: adaptive synthetic sampling.
eMRI: magnetic resonance imaging.
Risk-of-Bias Assessment of Included Studies
Based on a systematic evaluation using the PROBAST (), all included studies (23/23, 100% [,,,-]) were judged to have an overall “high” risk of bias. High risk primarily stemmed from 2 domains—the “participants” domain (23/23, 100% [,,,-], due to selection bias inherent in retrospective designs) and the “analysis” domain (20/23, 87% [,,,-,-], mainly attributable to inconsistent data preprocessing and shortcomings in validation strategies).
| Author | Participants | Predictors | Outcome | Analysis | Overall |
| Akazawa and Hashimoto [] | High | Low | Unclear | High | High |
| Akazawa and Hashimoto [] | High | Low | Unclear | High | High |
| Chen et al [] | High | Unclear | Unclear | Low | High |
| de Reus DC et al [] | High | Unclear | Unclear | Low | High |
| Li et al [] | High | Low | Low | High | High |
| Liu et al [] | High | Unclear | Unclear | High | High |
| Mo et al [] | High | Unclear | Low | High | High |
| Park et al [] | High | Low | Low | High | High |
| Shi et al [] | High | Low | Low | High | High |
| Shi et al [] | High | Unclear | Unclear | High | High |
| Stehrer et al [] | High | Low | High | High | High |
| Sun et al [] | High | Low | Low | Low | High |
| Wang [] | High | High | Low | High | High |
| Wakiya et al [] | High | High | High | High | High |
| Xu [] | High | High | High | High | High |
| Xue et al [] | High | High | High | High | High |
| Yang et al [] | High | High | High | High | High |
| Yang et al [] | High | Low | High | High | High |
| Yin et al [] | High | Unclear | High | High | High |
| Zheng et al [] | High | Low | Low | High | High |
| Zheng et al [] | High | Low | Low | High | High |
| Zong et al [] | High | Low | Low | High | High |
| Li [] | High | High | High | High | High |
Discussion
Principal Findings
This systematic scoping review synthesizes the current state of ML in predicting intraoperative bleeding in patients undergoing surgery. The results indicate that ML models demonstrate good discriminative ability (mean AUC 0.82, SD 0.008) and, in some scenarios, outperform traditional methods []. Multimodal data (eg, EHR combined with medical imaging) can further enhance predictive efficacy, aligning with the paradigm shift from “unimodal perception” to “multimodal cognition” []. However, the PROBAST assessment reveals a fundamental contradiction; despite significant technical potential, current studies exhibit a universally high risk of bias, particularly in the analysis domain (22/23, 87% of the included studies [,,,-,-]). This raises serious concerns that the reported performance metrics are likely overestimated. Specifically, this systematic risk of overestimation stems from three interconnected methodological shortcomings: (1) selective reporting and optimization bias, whereby studies tend to report only the best-performing models and favorable metrics (eg, AUC) while omitting critical measures such as calibration; (2) inadequate internal validation strategies, characterized by reliance on simple data splitting without temporal validation, which may lead to overfitting and overly optimistic performance estimates; and (3) insufficient handling of critical data issues, like class imbalance and missing data, which can artificially inflate discrimination metrics. Collectively, these flaws indicate that the reported mean AUC of 0.82 (SD 0.008) likely reflects optimal laboratory performance under ideal conditions, rather than the true generalizability of the models to independent, prospectively collected clinical data. This view is corroborated by the commonly observed performance degradation in the limited external validations available, where models often exhibit significant drops in AUC when applied to independent cohorts [,]. Based on this, the subsequent discussion of this review will systematically focus on these three core aspects—the completeness of model performance reporting, the rigor of validation strategies, and the transparency of data preprocessing and interpretability.
First, there is severe selective bias in the reporting of model performance, which limits a comprehensive assessment of their clinical applicability. Current research is overly focused on reporting discrimination metrics (AUC reported in 19/23, 83% of studies [,,,-,-]), while seriously neglecting calibration (reported in 10/23, 43% [,,,,,,-,]) and key classification metrics (eg, precision and F1-score, reported in 4/23, 17% [,,,]). This bias obscures two core issues. First, the widespread absence of model calibration assessment undermines the clinical credibility of predicted probabilities. Calibration reflects the consistency between predicted probabilities and actual risks, serving as the direct basis for risk stratification []. However, only a minority of studies reported calibration results [,,,,,,-,]. More critically, calibration performance is unstable and cannot be inferred from a high AUC. For example, one study [] reported good internal calibration, whereas independent external validation [] revealed significant miscalibration. This suggests that calibration must be independently evaluated, as its issues are often exposed during external validation. Furthermore, its absence in most studies casts doubt on the reliability of their “risk probability” outputs. Second, incomplete reporting of key classification metrics hinders the judgment of model utility. Precision is crucial for assessing alert efficiency and preventing alarm fatigue, yet its reporting is severely inadequate [,,,]. This makes it impossible to quantify the model’s false-positive risk. For instance, a model [] reported high sensitivity (eg, identifying most true bleeding events) but lower specificity, implying a higher number of false-positive alerts. Without reporting precision, the accuracy of these alerts cannot be quantified, making it difficult to assess whether this high-sensitivity strategy would lead to “alert fatigue” in practice. Conversely, the model developed by Xue et al [] achieved high accuracy (eg, most of its alerts are true), but its sensitivity might be low, potentially missing a considerable proportion of true bleeding events, which could increase the risk of clinical under-diagnosis. The systematic absence of these key metrics makes it challenging to evaluate model robustness across different clinical decision thresholds. Therefore, future research must strictly adhere to reporting guidelines, such as Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [] and comprehensively present calibration and classification metrics to bridge the gap between technical development and clinical practice.
Second, model validation strategies generally lack rigor. The widespread absence of external validation, in particular, weakens the reliability of their generalizability assessment. This review found that although over half of the studies (12/23, 52% [,,,,,-,]) established an independent test set, their internal validation mostly relied on simple data splitting, with only one study [] using the more robust temporal validation method. This overreliance on simple hold-out methods, coupled with limited adoption of methods such as cross-validation, may lead to optimistic performance estimates. More critically, external validation is severely lacking (only 6/23, 26% [,,,,,]), and performance degradation is commonly observed in implemented validations. This directly reveals the limited generalizability of models developed on homogeneous data. For example, the model by Shi et al [] experienced a decrease in AUC from 0.85 in internal validation to 0.63 during multinational, multicenter external validation []. Models by Yang et al [] and Zheng et al [] showed similar trends in external performance decline. A notable exception is the model by Chen et al [], which was built on large-scale multicenter data and showed improved performance in external validation, suggesting that an appropriate study design can enhance generalizability. In summary, the generalizability of existing models has not been sufficiently or rigorously validated. To further confirm the effectiveness and broad applicability of models in real-world settings, future research must incorporate prospective design, temporal validation, and multicenter external validation as key components of model evaluation.
Furthermore, insufficient transparency in data preprocessing and the widespread lack of model interpretability constitute another systemic methodological defect hindering research reproducibility and clinical translation. This review found that over 40% (11/23) of studies [,,,,-,,] did not report methods for handling missing data, and only 13% (3/23) [,,] addressed class imbalance. The reporting of data preprocessing steps is severely deficient and nonstandard (eg, failing to clearly describe key procedures such as handling missing values and normalization [,-,,,-]), thereby directly compromising model robustness and reproducibility. Although a few studies adopted more rigorous methods (eg, multiple imputation [,], SMOTE [], or adaptive synthetic sampling []), simpler strategies that may introduce bias (eg, direct case deletion [,,]) remain common. This lack of transparency makes interstudy comparison and independent replication exceptionally difficult and may partly explain the performance decline observed for some models during external validation []. Concurrently, model interpretability analysis is far from standard practice. The vast majority of studies lack any explanatory analysis (eg, SHAP values and feature importance), rendering them “black boxes” that clinical decision-makers find difficult to trust. Although a few studies have attempted to apply interpretability techniques, such as SHAP values or feature importance rankings [,], to identify key risk features and enhance transparency, this has not become routine. Therefore, future research must be committed to promoting the standardized reporting of data preprocessing workflows and deeply integrating interpretability analysis throughout the entire model development and validation process, which is a key prerequisite for building trustworthy and clinically usable prediction tools.
Future Research Directions
Based on the findings of this review, to promote the transition of prediction models from “technically feasible” to “clinically applicable,” future research should focus on four core directions. First, promote rigorous validation and generalizability assessment. Model development must move beyond retrospective single-center designs, collect data through multicenter prospective studies, and use temporal validation and independent external validation as cornerstones of evaluation to rigorously test their robustness. Second, improve performance reporting and clinical utility evaluation. Research must strictly adhere to reporting guidelines, such as TRIPOD, and fully present performance metrics. Furthermore, methods such as decision curve analysis should be actively adopted to quantify the clinical net benefit of models across different decision thresholds, aligning evaluation with real-world decision-making scenarios. Third, standardize data processing and enhance model interpretability. Detailed reporting of data preprocessing workflows, along with the adoption of advanced methods for handling missing values and class imbalance, should become standard practice. Simultaneously, interpretability techniques, such as SHAP, should be integrated into the development pipeline as essential components to elucidate risk mechanisms and build clinical trust. Finally, explore clinical integration pathways and evaluate real-world impact. Current research in the field mostly remains at the stage of model development and technical validation, and its potential clinical value has not yet been substantiated. Specifically, building on preliminary evidence, future research should be dedicated to deepening and validating the following key translational aspects. First, promote the prospective application and effect evaluation of prediction models to guide preoperative blood preparation. Although existing models show potential to optimize blood preparation strategies [,], their impact on resource conservation and team response efficiency after integration into actual workflows remains to be confirmed by prospective studies. Second, expand the generalizability and clinical integration of real-time alert models. Although some studies have successfully developed real-time prediction models for intraoperative massive transfusion and demonstrated excellent performance [], their generalizability across different surgical types and medical centers, as well as their actual alert efficacy and clinical acceptance after integration into anesthesia monitoring systems, requires further validation. Finally, and most challengingly, evaluate the improvement effect of model-based clinical decisions on patient hard endpoints through prospective interventional trials. Existing observational studies suggest that transfusion is associated with worse outcomes and higher costs []. Future well-designed studies are needed to confirm whether effective prediction-intervention strategies can ultimately achieve comprehensive benefits—such as reducing unnecessary transfusions and timely management of major bleeding—thereby lowering complications, improving patient prognosis, and saving medical costs.
Limitations
The limitations of this review primarily stem from the methodological quality of the included original studies. First, the search strategy may not have captured all relevant literature, posing a risk of omission. More critically, the widespread retrospective design and high risk of bias in the current field necessitate cautious interpretation regarding the true performance and generalizability of the evaluated models.
Conclusion
This scoping review indicates that research on ML for predicting intraoperative bleeding is growing rapidly in quantity, but the quality of studies has not improved correspondingly, constituting the main obstacle to clinical translation. Existing models are generally built on retrospective data and suffer from core methodological flaws, including a high risk of bias, a severe lack of external validation, and incomplete reporting of key performance metrics. Therefore, the clinical applicability and reliability of current models are far from established. To achieve the leap from methodological exploration to clinical utility, future research must meet higher standards—prioritize prospective design, enforce independent and multicenter external validation, strictly adhere to standardized reporting guidelines such as TRIPOD, and strive to explore effective pathways for integrating models into perioperative workflows.
Acknowledgments
No generative AI tools were used at any stage in the preparation of this manuscript. All content, including text, data, analyses, references, and citations, was generated and reviewed entirely by the authors. We remain fully responsible for the accuracy, originality, and integrity of the manuscript. SY and PZ are co-first authors.
Funding
No external financial support or grants were received from any public, commercial, or not-for-profit entities for the research, authorship, or publication of this article.
Data Availability
This study is a scoping review and does not involve the generation or analysis of new data. All data used in this review were extracted from publicly available papers indexed in PubMed, Web of Science, Embase, CINAHL, CNKI, Wanfang, and VIP. No new datasets were created or analyzed in the course of this research. The studies included in this review can be accessed through their respective journals and databases.
Authors' Contributions
SY contributed to conceptualization, methodology, investigation, and writing—original draft.
PZ contributed to methodology, formal analysis, data curation, and writing—original draft.
LX and JJ contributed to conceptualization, supervision, and project administration.
SX and WQ contributed to investigation.
HH and YG contributed to formal analysis and data curation.
All authors participated in writing—review & editing and approved the final manuscript.
Conflicts of Interest
None declared.
References
- Shah A, Palmer AJR, Klein AA. Strategies to minimize intraoperative blood loss during major surgery. Br J Surg. Jan 2020;107(2):e26-e38. [CrossRef] [Medline]
- Lin YM, Yu C, Xian GZ. Calculation methods for intraoperative blood loss: a literature review. BMC Surg. Dec 20, 2024;24(1):394. [CrossRef] [Medline]
- Sieśkiewicz A, Reszeć J, Piszczatowski B, et al. Intraoperative bleeding during endoscopic sinus surgery and microvascular density of the nasal mucosa. Adv Med Sci. Mar 2014;59(1):132-135. [CrossRef] [Medline]
- Park J, Kwon JH, Lee SH, et al. Intraoperative blood loss may be associated with myocardial injury after non-cardiac surgery. PLOS ONE. 2021;16(2):e0241114. [CrossRef]
- Shander A, Hardy JF, Ozawa S, et al. A global definition of patient blood management. Anesth Analg. Sep 1, 2022;135(3):476-488. [CrossRef] [Medline]
- Yoon D, Yoo M, Kim BS, et al. Automated deep learning model for estimating intraoperative blood loss using gauze images. Sci Rep. Jan 31, 2024;14(1):2597. [CrossRef] [Medline]
- Kirchhoff P, Clavien PA, Hahnloser D. Complications in colorectal surgery: risk factors and preventive strategies. Patient Saf Surg. Mar 25, 2010;4(1):5. [CrossRef] [Medline]
- Mahmood E, Matyal R, Mueller A, et al. Multifactorial risk index for prediction of intraoperative blood transfusion in endovascular aneurysm repair. J Vasc Surg. Mar 2018;67(3):778-784. [CrossRef] [Medline]
- Alali AA, Boustany A, Martel M, Barkun AN. Strengths and limitations of risk stratification tools for patients with upper gastrointestinal bleeding: a narrative review. Expert Rev Gastroenterol Hepatol. 2023;17(8):795-803. [CrossRef] [Medline]
- Akazawa M, Hashimoto K. A multimodal deep learning model for predicting severe hemorrhage in placenta previa. Sci Rep. Oct 13, 2023;13(1):17320. [CrossRef] [Medline]
- Eckhardt CM, Madjarova SJ, Williams RJ, et al. Unsupervised machine learning methods and emerging applications in healthcare. Knee Surg Sports Traumatol Arthrosc. Feb 2023;31(2):376-381. [CrossRef] [Medline]
- Chen X, Zhang H, Guo D, et al. Risk of intraoperative hemorrhage during cesarean scar ectopic pregnancy surgery: development and validation of an interpretable machine learning prediction model. EClinicalMedicine. Dec 2024;78:102969. [CrossRef] [Medline]
- Shi X, Cui Y, Wang S, Pan Y, Wang B, Lei M. Development and validation of a web-based artificial intelligence prediction model to assess massive intraoperative blood loss for metastatic spinal disease using machine learning techniques. Spine J. Jan 2024;24(1):146-160. [CrossRef] [Medline]
- Tricco AC, Lillie E, Zarin W, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. Oct 2, 2018;169(7):467-473. [CrossRef] [Medline]
- Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. Feb 2005;8(1):19-32. [CrossRef]
- Moons KGM, de Groot JAH, Bouwmeester W, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLOS Med. Oct 2014;11(10):e1001744. [CrossRef] [Medline]
- Kaul T, Damen JAA, Wynants L, et al. Assessing the quality of prediction models in health care using the Prediction model Risk Of Bias ASsessment Tool (PROBAST): an evaluation of its use and practical application. J Clin Epidemiol. May 2025;181:111732. [CrossRef] [Medline]
- Akazawa M, Hashimoto K. Prediction of hemorrhage in placenta previa: radiomics analysis of pelvic MRI images. Eur J Obstet Gynecol Reprod Biol. Aug 2024;299:37-42. [CrossRef] [Medline]
- de Reus DC, Kuijten RH, Saha P, et al. External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients. Spine J. Jul 2025;25(7):1386-1399. [CrossRef] [Medline]
- Li J, Jia YM, Zhang ZL, et al. Development and validation of a machine learning-based early prediction model for massive intraoperative bleeding in patients with primary hepatic malignancies. World J Gastrointest Oncol. Jan 15, 2024;16(1):90-101. [CrossRef] [Medline]
- Liu J, Wu T, Peng Y, Luo R. Grade prediction of bleeding volume in cesarean section of patients with pernicious placenta previa based on deep learning. Front Bioeng Biotechnol. 2020;8:343. [CrossRef]
- Mo J, Huang JY, Liu H, et al. Based on the deep neural network principle, construct prediction model for massive bleeding risk during hysteroscopic removal of scar gestation. Chin J Fam Plann Gynecol. 2023;15(3):72-77. [CrossRef]
- Park S, Park K, Lee JG, et al. Development of machine learning models predicting estimated blood loss during liver transplant surgery. J Pers Med. Jun 23, 2022;12(7):1028. [CrossRef] [Medline]
- Shi Y, Zhang G, Ma C, et al. Machine learning algorithms to predict intraoperative hemorrhage in surgical patients: a modeling study of real-world data in Shanghai, China. BMC Med Inform Decis Mak. Aug 10, 2023;23(1):156. [CrossRef] [Medline]
- Stehrer R, Hingsammer L, Staudigl C, et al. Machine learning based prediction of perioperative blood loss in orthognathic surgery. J Craniomaxillofac Surg. Nov 2019;47(11):1676-1681. [CrossRef] [Medline]
- Sun Z, Yang N, Wang L, Zhou J, Zhang H, Wang J. Constructing a predictive model for high intraoperative excessive bleeding in patients undergoing posterior lumbar decompression and fusion internal fixation surgery during outpatient visits. Clin Biochem. Jan 2025;135:110856. [CrossRef] [Medline]
- Wang YC. Value of MRI-Based Radiomics in Diagnosing Placenta Accreta Spectrum Disorders and Predicting Blood Loss during Cesarean Section. Gansu University of Chinese Medicine; 2023. [CrossRef]
- Wakiya T, Ishido K, Kimura N, et al. Prediction of massive bleeding in pancreatic surgery based on preoperative patient characteristics using a decision tree. PLOS ONE. 2021;16(11):e0259682. [CrossRef] [Medline]
- Xu XY. Value of T2WI-based deep learning and radiomics models in predicting blood loss risk during cesarean section for placenta accreta spectrum disorders. Guangdong Medical University; 2024. URL: https://d.wanfangdata.com.cn/thesis/Ch1UaGVzaXNOZXdTb2xyOVMyMDI2MDQxNTE0Mjg1MRIJRDAzNDg1NDIzGghla3o0bGZoaA%3D%3D [Accessed 2026-04-22]
- Xue Q, Zhu Y, Yang L, et al. Predicting intraoperative bleeding in patients undergoing a hepatectomy using multiple machine learning and deep learning techniques. J Clin Anesth. Nov 2021;74:110444. [CrossRef] [Medline]
- Yang B, Gao L, Wang X, et al. Application of supervised machine learning algorithms to predict the risk of hidden blood loss during the perioperative period in thoracolumbar burst fracture patients complicated with neurological compromise. Front Public Health. 2022;10:969919. [CrossRef] [Medline]
- Yang H, Wu X, Liu W, et al. A quantitative analysis framework of placenta accreta spectrum: placenta subtype, intraoperative bleeding, and hysterectomy risk evaluation based on magnetic resonance imaging-anatomical-clinical features. Quant Imaging Med Surg. Oct 1, 2023;13(10):7105-7116. [CrossRef] [Medline]
- Yin P, Sun C, Wang S, Chen L, Hong N. Clinical-deep neural network and clinical-radiomics nomograms for predicting the intraoperative massive blood loss of pelvic and sacral tumors. Front Oncol. 2021;11:752672. [CrossRef] [Medline]
- Zheng C, Yue P, Cao K, et al. Predicting intraoperative blood loss during cesarean sections based on multi-modal information: a two-center study. Abdom Radiol (NY). Jul 2024;49(7):2325-2339. [CrossRef] [Medline]
- Zheng Y, Wu CX, Yao ZX. Development of a hemorrhage prediction model for hepatectomy based on machine learning and preoperative data. Fujian Med Univ J. 2022;56(6):552-560. URL: http://kns--cnki--net--https.cnki.scrm.scsycy.vip:2222/kcms2/article/abstract?v=9sxxlkkbz5M8EE0nVpwqU-q3rja4vJevci2btiIU5sY8j8KErSyAfQCankDdkEL869MGzKbdF5MTYxkMF1NMbha1qC-G_Y_dJLGe03VDKwenUTgMP-T3x_vNC6pCOeiGLgA8Phc-aU2bzCqyE8dbObWAR7g5fXstcqkzzR5jlGg-B9EcIqGbp6qXjF9nkoIl&uniplatform=NZKPT&language=CHS [Accessed 2026-04-22]
- Zong M, Pei X, Yan K, et al. Deep learning model based on multisequence MRI images for assessing adverse pregnancy outcome in placenta accreta. J Magn Reson Imaging. Feb 2024;59(2):510-521. [CrossRef] [Medline]
- Li H. Analysis of Risk Factors and Development of a Predictive Model for Massive Hemorrhage during Cesarean Section in Women with Placenta Percreta. Jilin University; 2024. [CrossRef]
- Chen S, Guo X. A review of multimodal large models in the field of major diseases [Chinese]. Journal of Harbin Institute of Technology. Dec 15, 2025;57(12):156-164. URL: https://scholar.hit.edu.cn/en/publications/%E5%A4%9A%E6%A8%A1%E6%80%81%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9C%A8%E9%87%8D%E5%A4%A7%E7%96%BE%E7%97%85%E9%A2%86%E5%9F%9F%E7%9A%84%E7%A0%94%E7%A9%B6%E7%BB%BC%E8%BF%B0/ [Accessed 2026-04-06]
- Alba AC, Agoritsas T, Walsh M, et al. Discrimination and calibration of clinical prediction models: users’ guides to the medical literature. JAMA. Oct 10, 2017;318(14):1377-1384. [CrossRef] [Medline]
- Moons KGM, Damen JAA, Kaul T, et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. Mar 24, 2025;388:e082505. [CrossRef] [Medline]
- Lou SS, Liu H, Lu C, Wildes TS, Hall BL, Kannampallil T. Personalized surgical transfusion risk prediction using machine learning to guide preoperative type and screen orders. Anesthesiology. Jul 1, 2022;137(1):55-66. [CrossRef] [Medline]
- Zapf MAC, Fabbri DV, Andrews J, et al. Development of a machine learning model to predict intraoperative transfusion and guide type and screen ordering. J Clin Anesth. Dec 2023;91:111272. [CrossRef] [Medline]
- Lee SM, Lee G, Kim TK, et al. Development and validation of a prediction model for need for massive transfusion during surgery using intraoperative hemodynamic monitoring data. JAMA Netw Open. Dec 1, 2022;5(12):e2246637. [CrossRef] [Medline]
- Lang FF, Liu LY, Wang SW. Predictive modeling of perioperative blood transfusion in lumbar posterior interbody fusion using machine learning. Front Physiol. 2023;14:1306453. [CrossRef] [Medline]
Abbreviations
| AUC: area under the curve |
| CHARMS: Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling studies |
| CNKI: China National Knowledge Infrastructure |
| EHR: electronic health record |
| ML: machine learning |
| MRI: magnetic resonance imaging |
| PICOS: Population, Intervention, Comparator, Outcome, and Study design |
| PRISMA-ScR: Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews |
| PROBAST: Prediction model Risk of Bias Assessment Tool |
| SHAP: Shapley Additive Explanations |
| SMOTE: synthetic minority over-sampling technique |
| TRIPOD: Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis |
| VIP: China Science and Technology Journal Database |
Edited by Arriel Benis; submitted 18.Jul.2025; peer-reviewed by Juan-Jose Beunza, Suhila Sawesi; final revised version received 08.Jan.2026; accepted 08.Jan.2026; published 10.Jun.2026.
Copyright© Shiqiong Yan, Ping Zhang, Wanwan Qiao, Sijia Xie, Huan Hu, Yi Gao, Linli Xie, Jie Jing. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 10.Jun.2026.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

